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ABSTRACT 

Numerical Simulation is an essential part of the design and optimisation of astronomical adap- 
tive optics systems. Simulations of adaptive optics are computationally expensive and the 
problem scales rapidly with telescope aperture size, as the required spatial order of the cor- 
recting system increases. Practical realistic simulations of AO systems for extremely large 
telescopes are beyond the capabilities of all but the largest of modern parallel supercomput- 
ers. Here we describe a more cost effective approach through the use of hardware acceleration 
using field programmable gate arrays. By transferring key parts of the simulation into pro- 
grammable logic, large increases in computational bandwidth can be expected. We show that 
the calculation of wavefront sensor image centroids can be accelerated by a factor of four by 
transferring the algorithm into hardware. Implementing more demanding parts of the adaptive 
optics simulation in hardware will lead to much greater performance improvements, of up to 
1000 times. 

Key words: instrumentation: adaptive optics - methods: numerical - techniques: miscella- 
neous - telescopes - instrumentation: high angular resolution 



1 INTRODUCTION 

Adaptive optics (AO) is a technology widely used in opti- 
cal and infra-red astronomy, and all large science telescopes 
have an AO system. A large number of results have been ob- 
tained using AO systems which would otherwise be impossible 
for se ei ng-limited observation s (see for example Gendr on et alJ 

(2004) ; Masciad ri et alJ l2005li ). New AO techniques are being 
studied for novel applicat ions suc h as wide-field high resolu- 
tion imagin g I Marche tti et all2004l) and extra-solar planet finding 
I Mouil let et all2004h . 

The simulation of an AO system is important to determine how 
well an AO system will perform. Such simulations are often neces- 
sary to determine whether a given AO system will meet its design 
requirements, thus allowing scientific goals to be met. Additionally, 
new concepts can be modelled, and the simulated perform ance of 
different AO techniques compared (see for example Verinaud et al. 

(2005) ), allowing informed decisions to be made when designing 
or upgrading an AO system and when optimising the system de- 
sign parameters. 

A full e nd-to-end AO simulation will typically involve several 
stages JCarbillet et alT2 005). Firstly, simulated atmospheric phase 
screens must be generated, to represent the atmospheric turbulence, 
often at different atmospheric heights. The aberrated complex wave 
amplitude at the telescope aperture is then generated by simulating 
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the atmospheric phase screens moving across the pupil. The wave- 
front at the pupil is then passed to the simulated AO control sys- 
tem, which will typically include one or more wavefront sensors 
and deformable mirrors and a feedback algorithm for closed loop 
operation. Additionally, one or more science point spread functions 
(PSFs) as seen through the AO system are calculated. Information 
about the AO system performance is computed from the PSF, in- 
cluding quantities such as the Strehl ratio and encircled energy. 

The computational requirements for AO simulation 
scale rapidly with telescope size, and so simulation of the 
largest telescopes cannot be done with out special tech niques, 
such as multiprocessor para llelisation llx Louarn et alJ 120041 : 
Ahmadia and Ellerbroek 2003) (which can suffer from IO band- 
width bottlenecks) or the use of analytical models IConan et alJ 
2003) (which can struggle to represent noise sources properly). 
We here describe a different approach to parallelisation using a 
massively parallel programmable hardware architecture. 



1.1 The Durham adaptive optics simulation platform 

At Durham University, we have bee n developing AO simulation 
codes for over ten years JPoellll995h . The code has recently been 
rewritten to take advantage of new hardware, new software tech- 
niques, and to allow much greater scalability for advanced simu- 
lation of extremely large telescopes (ELTs), multi-conjugate AO 
(MCAO) and extreme AO (XAO) systems <Russell et all2004l) . 
The Durham AO simulation platform uses the high level 
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Figure 1. A diagram showing an overview of the Cray XDl hardware. Six 
identical nodes containing two Opteron processors and an FPGA are con- 
nected via a high bandwidth interconnect. In the diagram, black arrows rep- 
resent bi-directional data transfer at 3.2 GBs -1 (1.6 GBs" 1 in each di- 
rection) and grey arrows represent bi-directional data transfer at 2 GBs -1 
(1 GBs -1 in each direction). The memory is connected with a bandwidth 
of 12.8 GBs -1 to the FPGA and 6.4 GBs" 1 to each CPU. The compo- 
nents labelled "SF" are the switching fabric processors. The FPGA is also 
connected directly to those in other nodes via high-bandwidth serial link. 



programming language Python to join together C, hardware and 
Python algorithms. This allows us to rapidly prototype and develop 
new and existing AO algorithms, and to prepare new AO system 
simulations quickly. The use of C and hardware algorithms ensures 
that processor intensive parts of the simulation platform can be im- 
plemented efficiently. 

The simulation software will run on most Unix-like operating 
systems, including Linux and Mac OS X. The simulation platform 
hardware at Durham consists of a Cray XDl supercomputer and a 
distributed cluster of conventional Unix workstations all connected 
by giga-bit Ethernet. For most simulation tasks, only the XDl is 
required, though for large models, or when multiple simulations 
are run simultaneously, the entire distributed cluster can be used. 

1.2 The Cray XDl supercomputer 

The Cray XDl supercomputer (Fig.0 is essentially a number of 
dual Opteron processing nodes (running at 2.2 GHz) connected 
by a high bandwidth interconnect allowing a maximum through- 
put of 1 GB/s between nodes in each direction lCravll2005h . At 
Durham, we have a single chassis XDl, which contains six such 
nodes. Within each node, we have 8 GB of CPU memory (400 MHz 
DDR memory). Each node also contains an application acceleration 
module with a Xilinx Virtex-II Pro field programmable gate array 
(FPGA) (which allows user logic to be clocked at up to 199 MHz) 
and 16 MB dual port SRAM connected to the FPGA (4x4 MB 
banks). This FPGA is connected directly to the high bandwidth in- 
terconnect and can be either bus master or slave. The FPGA can 
access the host Opteron memory with a bandwidth of 1.6 GBs" 1 in 
each direction, without need for processor intervention when it is 
bus master, and also access local SRAM memory with a bandwidth 
of 12.8 GBs" 1 . Additionally, a software process can write to reg- 
isters or memory within the FPGA and SRAM memory when the 
FPGA acts as a bus slave. 

The XDl operating system is an o ptimised version o f Suse 
Linux with improvements made by Cray I Brown. D. HI2004T) . The 



Data 3 



One task at a time: 
Taskl then Task2 then 
Task3 then Task4... 



Data 2 



►Data 1 



FPGA 



Data 7 



Many tasks in parallel: 
Taskl Task2 Task3 Task4 
Data 6*-Data 5»-Data 4*-Data 3HData 2 



►Data 1 



Figure 2. A diagram showing a comparison of a set of calculations per- 
formed in serial in a CPU and the same calculations implemented in a par- 
allel pipeline in an FPGA. The CPU carries out tasks sequentially to a given 
piece of data. The FPGA carries out many tasks simultaneously on different 
pieces of data. 



high bandwidth interconnect between nodes is designed for low la- 
tency communication, with MPI having a latency of only 1.6 fis 
and a sustainable bandwidth of 900 MB/s (in one direction) be- 
tween nodes, as measured on the Durham system. 



1.3 Programmable logic 

Often software algorithms will consist of a simple repetitive task 
that is performed many times, thus requiring large amounts of CPU 
processing time. If this task can be offloaded from the CPU to 
some application accelerator, the CPU is then free to carry out other 
tasks, thus reducing the total time to complete a calculation. Pro- 
grammable logic, in the form of an FPGA is ideal for use as an 
application accelerator. A simplistic view of an FPGA is that of a 
large amount of logic (AND, OR, NOT gates, latches, etc.) which 
can be connected in a user determined way. Logic can then be built 
up within the FPGA to perform simple (or even complicated) tasks. 

An FPGA will normally be programmed using a hardware de- 
scription language (HDL), such as VHDL. Once written by the 
user, code is synthesised and mapped into the FPGA. Common al- 
gorithms which can be placed into an FPGA include fast Fourier 
transforms (FFTs) and hardware control algorithms. 

An FPGA will typically be clocked at only a tenth of the speed 
of a CPU. However, by implementing many operations in parallel, 
they can give a performance improvement due to the high degree 
of parallelisation that they afford. Fig. [2] shows a comparison of a 
pipeline implemented in an FPGA and a CPU, and demonstrates 
the parallelisation available to an FPGA user. 

The XD 1 supercomputer at Durham contains six FPGAs, each 
with 53, 136 logic cells. These can be used to help accelerate 
the AO simulations. Currently only the centroid algorithm, part of 
the Shack-Hartmann wavefront sensor (SHWFS) has been imple- 
mented in hardware and this is able to give a performance improve- 
ment when compared with an optimised software implementation. 
The FPGA clock speeds are user selectable, up to an absolute max- 
imum of 199 MHz (set by Cray). The speed selected should be de- 
termined dependent on the user logic wi thin the FPGA, and FPGA 
vendor tools such as ISE <Xilinxll2005t) . give an estimate for the 
maximum clock speed that should be used with a given design and 
FPGA. 

The results of the performance increases obtained from us- 
ing the FPGAs are presented in the remainder of this paper, as are 
details of algorithms that will be implemented in hardware in the 
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near future, along with estimates of the performance improvements 
these will give. 



2 HARDWARE ACCELERATION 

As a first step we have implemented a centroid algorithm in the 
FPGAs. Although this is not the most demanding of applications 
for software, it maps well into hardware, and so is fairly straight- 
forward to implement. This has allowed us to test interface code in 
the FPGA which is used to read and write data from and to the host 
CPU memory, develop a generic software interface to the FPGA 
and will also give us some idea of the performance gains which are 
attainable with the FPGA. 

A SHWFS measures wavefront aberrations by measuring the 
wavefront gradient across the telescope aperture. This is done by 
dividing the aperture into a grid of sub-apertures using a lenslet 
array. The centroids of each sub-aperture image then represent the 
local wavefront slopes and this information can be used to shape 
the surface of a deformable mirror, thus removing or reducing the 
wavefront aberrations. 
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Figure 3. A diagram showing the time required to apply the centroid al- 
gorithm to a given amount of data. The performance of optimised C code 
(labelled software) running on the 2.2 GHz Opteron processors is shown 
along with the performance of the FPGA algorithm operating at different 
(labelled) clock speeds. The data to which the centroid algorithm is applied 
is composed of sub-apertures each with 16 pixels (4 X 4), with each pixel 
represented by 16 bits. The best fit lines are shown. 



2.1 Centroid algorithm 

The centroid algorithm that has been implemented in the FPGA 
reads sub-aperture image photon data from the host CPU main 
memory (requiring no CPU intervention), computes the x and y 
centroid positions of this sub-aperture and writes them back into 
the CPU main memory (again without CPU intervention). The sub- 
aperture image photon data is in the form of 16 bit unsigned integer 
numbers (typical of CCD pixel values), and the x and y centroid 
positions are returned in the form of 32 bit floating point numbers. 
These centroid positions are effectively the mean position of all 
the photons within the sub-aperture, and are computed within the 
FPGA by summing each individual photon count multiplied by its 
x or y position (in the range to N XiV — 1 where N X]V repre- 
sents the total number of pixels in the sub-aperture in the x or y 
direction), and dividing by the sum of all photons to give a fixed 
point number which is then converted into floating point. Finally, 
an offset equal to — ^ — is subtracted from the x and y centroid 
position so that the numbers returned to the host CPU have a range 
from — — ^ — - to — - and a position of represents the cen- 
tre of the sub-aperture. This FPGA implementation gives an exact 
agreement with that obtained using a traditional software centroid 
algorithm. 



2.1.1 User control 

User application software is used to control the hardware cen- 
troid algorithm. The user can specify the start address to read sub- 
aperture image data from, and the number of bytes of data to read, 
as well as the address to which the x and y centroids should be 
written. The user is also able to specify the number of pixels in 
the x and y directions within the sub-aperture and to start and stop 
the centroid calculation. Additionally, the user can specify a pixel 
weighting, which will map a 16 bit input pixel value to a weighted 
pixel value to be used in the centroid algorithm. Two weightings 
commonly used are to raise the pixel values to the power of 1.5 or 
two. 

This illustrates that although the FPGA is programmed with 



a fixed function, the user code that it implements can still be pro- 
grammable. When programming an FPGA there is a trade-off be- 
tween the flexibility of the algorithm and the time taken and FPGA 
resources used to implement the algorithm. 

Once the centroid calculation has been started, the FPGA will 
read data from the specified memory bank in host memory, and 
whenever enough data has been read, will calculate a centroid 
value, which will be returned to the host memory. It is therefore 
possible to compute the centroid locations of many sub-apertures 
with a single start command. 

2.2 Performance improvements 

The performance of the FPGA when calculating centroids has been 
compared with that of optimised C code by comparing the time that 
each takes to compute centroids from data in an array of given size. 
This allows us to compare performance when accessing both large 
and small data arrays, and with different sized sup-apertures. 

2.2.1 Processing time 

Fig.|3|shows the time taken by the FPGA and C code to apply the 
centroid algorithm to a given amount of data. Timings for several 
different FPGA clock speeds are shown. The logic we have imple- 
mented allows us to operate the FPGA at speeds up to 199 MHz. 
However, we have also investigated performance at lower clock 
speeds because future developments may necessitate the use of 
lower clock speeds, for example if the simulation of CCD read- 
out noise is implemented in the FPGA, and the logic is such that a 
slower clock speed is required. 

The gradient of the lines in Fig.|3|represent the time in seconds 
to apply the centroid algorithm to a given amount of data in bytes. 
The FPGA is able to transfer eight bytes of memory every clock 
cycle both into and out of the FPGA simultaneously, and due to 
the pipelined design of the FPGA centroid algorithm, data transfer 
should be the performance limiting factor. With a 100 MHz clock, 
the FPGA should be able to transfer data at a maximum rate of 
800 MBs -1 , corresponding to 1.25 ns per byte. This is equal to 
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Figure 4. A diagram showing the increase in performance when the cen- 
troid algorithm is computed in hardware instead of software as a function 
of FPGA clock speed, with the performance increase being the ratio of soft- 
ware computation time to hardware computation time. A sharp cutoff at 
1 70 MHz shows the point at which the data bus is saturated. A total of 2 MB 
of sub-aperture pixel data was centroided, with 16 pixels per sub-aperture 
and 16 bits per pixel. 

the gradient of the 100 MHz data on the diagram, meaning that 
near maximum performance is being achieved, and that this perfor- 
mance is bandwidth limited as expected. Similarly, for clock speeds 
up to 170 MHz, we find that the theoretical data transfer rate is 
matched by the gradients on the diagram. However, at clock speeds 
above 170 MHz, the computation time does not decrease further. 

A clock speed of 199 MHz should provide a maximum data 
transfer rate of 1592MBs _1 , corresponding to a data transfer time 
of about 0.625 ns per byte. However, as shown in Fig. [3] this is not 
achieved. 

Fig.|4|shows the performance gained when implementing the 
centroid algorithm in hardware as a function of FPGA clock speed. 
The sharp cutoff at 170 MHz, corresponding to a data transfer rate 
of 1.36 GBs -1 (8 bytes per clock cycle) occurs because this data 
transfer rate is approaching the theoretical maximum allowed be- 
tween the CPU and FPGA (1.592 GBs -1 ). The memory is accessed 
from the FPGA via the Opteron memory controller and a fabric 
switch (see Fig. 0, and these also have to arbitrate CPU access 
to the memory and communication between other computational 
nodes within the XD1 system, reducing performance below the the- 
oretical maximum. This suggests that maximum data transfer per- 
formance (and hence centroid performance) can be achieved with a 
FPGA clock speed of 170 MHz. Increasing the clock speed above 
this will not give a performance increase, and the centroid data pro- 
cessing pipeline in the FPGA will be idle for some clock cycles 
when no new data has arrived from the memory. 



2.2.2 FPGA latencies 

Fig. [5] shows the time taken to apply the centroid algorithm to a 
small amount of sub-aperture image photon data. The y axis inter- 
cepts for the FPGA data are equal to about 3-5 fis dependent on 
clock speed, and this represents the latencies involved when using 
the FPGAs. A functional simulation of the FPGA algorithm (using 
tools provided by the FPGA vendor) shows that the time between 
the last data arriving in the FPGA and leaving the FPGA should be 
of order 2-3 /is (dependent on clock speed), due to the length of the 
FPGA algorithm pipeline. The additional 1-2 fis latency are due to 
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Figure 5. A diagram showing the time required to apply the centroid algo- 
rithm to a given amount of data, as with Fig. [3] The solid black curve rep- 
resents the time taken by the software implementation while the grey lines 
represent the time taken by the FPGA implementation at 100 and 199 MHz. 
The grey lines are the same as those representing the same clock speed in 

Kg.ii 

the memory arbitration between the CPU and FPGA and time taken 
by software to start the FPGA operation. 

Fig.|5|shows that when applying the centroid algorithm to less 
than about 2-4 KB data (depending on clock speed), it is faster to 
compute the centroids using the CPU, as the FPGA overheads then 
begin to dominate. However, in a typical AO simulation, a very 
large number of centroids will be computed. 

2.2.3 Relative performance 

The most important reason for using FPGAs in AO simulation is for 
the improved performance that they are able to give. Fig.|6|shows a 
comparison of the time taken to calculate centroid positions using 
the FPGA and when using C code. 

When using FPGA clock speeds above 170 MHz, the centroid 
algorithm is up to four times faster in the FPGA when compared 
to the C algorithm for memory blocks greater than about 100 KB. 
For smaller memory blocks, the latency introduced when using the 
FPGA begins to have a larger effect and so the performance gain 
reduces. However, even when using a 4 KB memory block (two 
32 x 32 pixel sub-apertures or 128 4 x 4 pixel sub-apertures at 16 
bits per pixel), the FPGA still gives some performance gain with 
the software algorithm taking almost twice as long to perform the 
same computation. 

When applying the centroid algorithm to a dataset of a given 
size, the performance of the software algorithm will depend partly 
on the number of pixels in each sub-aperture, as shown in Fig.0 
However, the FPGA implementation does not have this depen- 
dence, since the pipeline architecture is independent of the sub- 
aperture size, with computation time depending only on the total 
amount of data and the small fixed pipeline latency. 



3 FUTURE CONSIDERATIONS AND IMPROVEMENTS 

The performance increase of four times when using the FPGAs 
may seem small, particularly when considering the amount of time 
required to write and debug the VHDL code for the FPGA. A flex- 
ible software centroid algorithm can be written in C taking only 
10-20 lines of code. However, when implemented in hardware, our 
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Figure 6. A diagram showing the increase in performance seen when using 
an FPGA centroid algorithm instead of a software algorithm. The perfor- 
mance gain is seen to level out when the centroid algorithm is applied to 
more than about 100 KB of sub-aperture pixel data. The performance in- 
crease is defined as the ratio of the time taken by software to the time taken 
by hardware when applying the centroid algorithm. The data to which the 
centroid algorithm is applied is composed of sub-apertures each with 16 
pixels, with each pixel represented by 16 bits. 
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Figure 7. A diagram showing the effect of sub-aperture size on centroid 
computation time for the software algorithm (dashed curve) and the hard- 
ware algorithm (solid line). A total of 2 MB data were input to the centroid 
algorithm in this case (1 MPixels). 

implementation contains about 2000 lines of code for host memory 
access (reading pixel values from host CPU memory, and writing 
the centroid values to host memory) and about 1200 lines of code 
for the centroid algorithm (of which half of this is responsible for 
fixed point to floating point conversion and pipeline flow control). 
This does not include the libraries used, which are treated as black 
boxes (for example Cray specific libraries, division libraries, and 
FPGA block memory access libraries), and for which, there is no 
source code available. 

However, only about a quarter of the FPGA has been used (a 
large fraction of which is the logic to read and write data from and 
to the CPU memory), and there is therefore room for extra logic 
to perform additional calculations. If these additional calculations 
are able to act on the data immediately before or after the cen- 
troid algorithm, they will take virtually no extra time to compute, 
since the limiting factor is the time taken to read data into and out 
of the FPGA. The performance gained over software implementa- 



tions will therefore be much greater, depending on the amount of 
calculation transferred into the FPGA pipeline. 

3.1 Pipeline extensions 

The current success of the FPGA algorithm is thus only the start 
of the hardware acceleration that can be achieved using the FPGAs 
and is limited by the rate at which data can be passed into and out 
of the FPGA. We aim to extend the number of algorithms that are 
placed within the hardware, including the algorithms related to the 
Monte-Carlo nature of the simulation. By ensuring that these algo- 
rithms are part of a single SHWFS pipeline, data transfers to and 
from host memory can be minimised, thus maximising the perfor- 
mance of the FPGA algorithms. The initial aim is to implement 
an AO simulation pipeline as shown in Fig. [8] This includes a 2D 
FFT of the input optical complex amplitude, high light level PSF 
computation (by taking the squared modulus of the output of the 2D 
FFT algorithm), the inclusion of sky background noise, photon shot 
noise, CCD readout noise, noise and background subtraction and 
finally the centroid algorithm. With such a pipeline, atmospheric 
pupil phase data will be read into the pipeline by the FPGA, and 
the centroid values will be written back to the host CPU memory. 
Due to the highly parallel nature of FPGAs, the total computation 
time will be virtually identical to that presented for the centroid 
algorithm here, with a slightly higher initial latency (which is neg- 
ligible when large amounts of data are involved). To perform such 
a pipeline in software takes much longer as each stage would be 
performed in series, with the total computation time being the sum 
of the times for each individual stage. 

3.2 Expected pipeline performance 

Once the pipeline shown in Fig.[5]has been implemented, the time 
to perform these calculations in the FPGA will be slightly greater 
than the times shown in Fig.|5| due to a slight increase (of order mi- 
croseconds) in the initial latency due to the longer pipeline. When 
considering the simulation of an AO system with 32 x 32 sub- 
apertures each with 8x8 atmospheric complex amplitude values 
(32 bit floating point), a total of 32 2 x 8 2 x 4 = 262144 bytes 
will need to be read into the FPGA, which with a 170 MHz clock 
will take approximately 200 fis. The current software simulation 
can perform this operation in approximately 0.25 s and so the per- 
formance will be increased by up to 1000 times, an improvement 
which would be difficult and expensive to achieve using only CPU 
based solutions. 

3.3 Separate algorithms 

In addition to implementing the pipeline in Fig.|2|in hardware, we 
also intend to implement several other algorithms in the FPGA: 

(i) Atmospheric phase screen generation 

(ii) Wavefront reconstructor algorithms 

(iii) Deformable mirror surface figure calculations 

(iv) Science PSF calculation 

These algorithms will help to improve the performance of the 
AO simulations further. Since they will not be integrated into the 
SHWFS pipeline, they must operate separately from the SHWFS 
pipeline and so may be placed in different FPGAs, thus having cer- 
tain nodes of the XD1 dedicated to certain parts of the AO simula- 
tion. 
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Figure 8. A diagram showing the pipeline stages which will eventually be 
implemented in the FPGA logic. Data will be read into the FPGA at the 
start of the pipeline, and the result transferred back to the host CPU at the 
end of the pipeline. 



4 CONCLUSION 

We have presented an overview of initial performance improve- 
ments made to the Durham AO simulation platform using pro- 
grammable logic (FPGAs). At present, only a centroid algorithm 
has been implemented in hardware, and this gives performance im- 
provements of up to a factor of four when compared to optimised C 
algorithms. By implementing more of the simulation in the FPGA 
we expect to increase the speed of AO simulation by up to 1000 
times. In this way we expect to implement realistic simulations of 
ELT-scale AO systems on a single Cray XD1 platform, at speeds 
that will be useful for detailed design and optimisation studies of 
future instruments. 
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